An Information Theoretic Model for Database Alignment

نویسندگان

  • Patrick Pantel
  • Andrew Philpot
  • Eduard H. Hovy
چکیده

As with many large organizations, the Government's data is split in many different ways and is collected at different times by different people. The resulting massive data heterogeneity means that government staff cannot effectively locate, share, or compare data across sources, let alone achieve computational data interoperability. The premise of our research is that it is possible to significantly reduce the amount of manual labor required in database wrapping and integration by automatically learning mappings in the data. In this research, we applied statistical algorithms to discover column correspondences across environmental databases. We have seen particular success in an information theoretic model, which we call SIfT, which performs data-driven column alignments. We have applied SIfT to mapping Santa Barbara and Ventura County Air Pollution Control Districts’ 2001 and 2002 emissions inventory databases with the California Air Resources Board statewide inventory database. The application of SIfT yielded 75% precision and 72.2% recall on the column alignment task. On a task of integrating new district data with the statewide database, we achieved 55% accuracy for Ventura County and 59% accuracy for Santa Barbara County.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Information-Theoretic Discussion of Convolutional Bottleneck Features for Robust Speech Recognition

Convolutional Neural Networks (CNNs) have been shown their performance in speech recognition systems for extracting features, and also acoustic modeling. In addition, CNNs have been used for robust speech recognition and competitive results have been reported. Convolutive Bottleneck Network (CBN) is a kind of CNNs which has a bottleneck layer among its fully connected layers. The bottleneck fea...

متن کامل

UCLA Research on Information Management for Bioinformatics

This research initially focused on construction of a graph database for bioinformatics — a database system allowing storage and retrieval of labeled, directed graphs. Intuitively, graph databases seem well-suited for applications like bioinformatics that involve complex, unstructured information in a fundamental way. However, it is difficult to translate graph data management research into bioi...

متن کامل

A generalization of Profile Hidden Markov Model (PHMM) using one-by-one dependency between sequences

The Profile Hidden Markov Model (PHMM) can be poor at capturing dependency between observations because of the statistical assumptions it makes. To overcome this limitation, the dependency between residues in a multiple sequence alignment (MSA) which is the representative of a PHMM can be combined with the PHMM. Based on the fact that sequences appearing in the final MSA are written based on th...

متن کامل

Combination of real options and game-theoretic approach in investment analysis

Investments in technology create a large amount of capital investments by major companies. Assessing such investment projects is identified as critical to the efficient assignment of resources. Viewing investment projects as real options, this paper expands a method for assessing technology investment decisions in the linkage existence of uncertainty and competition. It combines the game-theore...

متن کامل

Addressing the Causes and Failure for Financial Transformation while Achieving Business Alignment

The financial transformation journey is often addressed through trying to avoid the pitfalls associated with the causes of failure while leveraging the critical success factors. At best, Chief Financial Officers adopting this approach are likely to improve the degree of customer service experienced by Finance department. This is unlikely to lead to sustainable financial transformation being ach...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005